Detecting Luggage and People on Airport Escalators (Demo)

By: Adrian Avram and Abigail Batinga

Description of Project:

  • The Problem: The problem of passengers bringing dangerous objects onto escalators is evident across airports around the world and has not been addressed with high concern. This is alarming because the chances of accidents occuring on escalators is becoming more and more prominent. Simple approaches would be adding a sign that simply shows objects not to be taken on escalators at the entrance of escalators but this doesn't guarantee that all passengers will comply. Another, more direct approach would be to place an employee at the entrance of escalators to "screen" passengers but this seems unnecessary and unefficient in terms of resources and time flow. Due to this, a new approach could be automation. What better way to explore automation of this process other than the ubiquitous field of machine learning?


  • The Approach: In teams of two, the group of machine learning interns decided to take two approaches to this issue to compare applicability and accuracy. The first team took a fundamental approach using binary image classification. We were given a dataset of escalator images at the SeaTac airport which started on Wednesday, March 20, 2019 at 10:52 AM and ended on Thursday, March 21, 2019 at 9:06 AM. My team decided to take a specific approach to address potential concerns with the binary classification model. We decided that rather than making this a binary classification problem, it could be approached through object detection which would make it more applicable to different locations since the model is not trained on the entire image and scenario but rather the objects.


  • Tools Used: For this model, my teammate and I relied on the Tensorflow Object Detection API and detection model zoo because they have recorded high accuracy rates. We decided to us the "Faster-R-CNN Inception V2" model.

Further Applications:

  • Another problem arising at SeaTac Airport is that of indoor airport congestion. This congestion could be mitigated by properly directing traffic with relatively equal distribution to avoid increased congestion at certain locations. By detecting people as well as counting these detections rather than only luggage at each escalator, patterns can be detected in traffic in relation to time and location. Visualizations of these patterns will be shown later in this notebook.

Note:

  • This notebook will simply show inferences ran on a couple images. It will also show the visualizations between traffic, location, and time. This notebook does not show the full process of training the model because setting up training did not require much coding since this is a pre-built API.

Breakdown of How Faster RCNN Works

Feature Extraction

In [3]:
Out[3]:
  • Image is preprocessed and inputted into a pretrained feature extractor (VGG-16)
  • Starts by extracting low level details such as edges
  • Each filter adds another layer of depth
  • Subsequent layers attempt to attract patterns or high level details from image
  • Relative location of features remains constant

Anchors

In [5]:
Out[5]:
  • A list of various sizes and ratios is defined
  • For each size, a bounding box for each ratio is created with the anchor as the center
    • Each color depicts a different size with the varying length to width ratios
  • Anchors are evenly spread throughout the entire image (With VGG-16 they are 16 pixels apart)
  • Anchors can be customized to suit needs
    • For a model focusing on classifying people small verticle rectangles may be more useful and speed up training

Region Proposal Network (RPN)

In [22]:
Out[22]:
  • Convolutional Neural Networks output of a fixed shape
  • Bounding boxes for various regions of interest must be formed. However, it is impossible to know the number of outputs in advance
  • Solution: Have every anchor be evaluated by the region proposal network
    • May seem inefficient at first, but is in fact a fast and accurate solution to evaluating every region
  • The convolutional feature map is inputted and has two sets of values that are outputted; one predicts the probability that the region contains an object (background vs foreground), and the second predicts offset values to produce a more precise bounding box

Region of Interest Pooling Layer (ROI Pooling)

In [20]:
Out[20]:
  • Each of the proposed regions is a different size and none of the regions are assigned to their specific class (Although in this case the regions can only belong to one class which simplifies computation)
  • Crops out each proposal from the convolutional feature map which was generated by VGG-16 earlier and resizes to 14x14xDepth
  • A max pooling layer is then applied to each cropped section filtering out the unimportant details and standarizing the final input size to be 7x7xDepth

Region Based Convolutional Neural Network (R-CNN)

In [23]:
Out[23]:
  • Final step is to classify each of the proposals
    • The R-CNN also adds an extra background class to eliminate empty bounding boxes
  • Further refines the bounding boxes that were given
  • The feature map for each proposal is flattened before being inputted into fully connected dense layers
  • Final output is a N classes + 1 (+1 because of background), along with 4 × N modifications for each corner of the bounding boxes

Complete Diagram and Summary

In [16]:
Out[16]:
  • Image is inputted into a feature extractor where a convolutional feature map is generated
  • The convolutional feature map has anchors added on and is sent to a RPN
  • The region proposal network comes determines if anchors are background or foreground and modifies bounding boxes
  • RoIP standardizes the inputs and sends them to the R-CNN
  • R-CNN outputs classification predictions and final bounding box locations

Why Faster R-CNN?

Model name Speed (ms) COCO mAP
ssd_mobilenet_v1_coco 30 21
ssd_mobilenet_v1_0.75_depth_coco 26 18
ssd_mobilenet_v1_quantized_coco 29 18
ssd_mobilenet_v1_0.75_depth_quantized_coco 29 16
ssd_mobilenet_v1_ppn_coco 26 20
ssd_mobilenet_v1_fpn_coco 56 32
ssd_resnet_50_fpn_coco 76 35
ssd_mobilenet_v2_coco 31 22
ssd_mobilenet_v2_quantized_coco 29 22
ssdlite_mobilenet_v2_coco 27 22
ssd_inception_v2_coco 42 24
faster_rcnn_inception_v2_coco 58 28
faster_rcnn_resnet50_coco 89 30
faster_rcnn_resnet50_lowproposals_coco 64 --
rfcn_resnet101_coco 92 30
faster_rcnn_resnet101_coco 106 32
faster_rcnn_resnet101_lowproposals_coco 82 --
faster_rcnn_inception_resnet_v2_atrous_coco 620 37
faster_rcnn_inception_resnet_v2_atrous_lowproposals_coco 241 --
faster_rcnn_nas 1833 43
faster_rcnn_nas_lowproposals_coco 540 --
mask_rcnn_inception_resnet_v2_atrous_coco 771 36
mask_rcnn_inception_v2_coco 79 25
mask_rcnn_resnet101_atrous_coco 470 33
mask_rcnn_resnet50_atrous_coco 343 29
  • Variety of models to choose from
  • Faster R-CNN is well known, developed by Microsoft
  • Strikes a proper balance between the speed vs accuracy tradeoff

Fine Tuning (aka Transfer Learning)

In [22]:
Out[22]:
  • Faster R-CNN is trained on ImageNet which has 1.2 million images in 1000 categories
  • The classes do not directly apply to object detection on escalators. However, as is true with humans, past knowledge can help learn new topics
  • Training object detection from scratch requires an immense amount of data and computing power
  • Fine tuning saves valuable time and is just as effective as training from scratch

Gathering Input Data

In [28]:
  • Data was taken from one security camera over a period of 1-2 days
  • Features a variety of situations

Preprocessing Data

  • Initially 2,669 images
  • Many images did not contain people or objects, so they were not included for training
  • Ended up with 612 total labelled images
  • Images were resized to 720 x 720 to improve training time with Fast-R-CNN-InceptionV2

Labelled Data

  • 487 labelled training images
  • 125 labelled test images

Labelling Process

  • Software used: LabelImg
  • Objects we wanted to detect were labelled with a bounding box and the class label.
  • These annotations were stored as XML files with every image having a corresponding XML file.
In [63]:
Out[63]:

XML File

<annotation>
    <folder>train</folder>
    <filename>escalator_122.jpg</filename>
    <path>C:\Users\aya708\Desktop\TensorFlow\workspace\escalator_object_detection\images\train\escalator_122.jpg</path>
    <source>
        <database>Unknown</database>
    </source>
    <size>
        <width>1280</width>
        <height>720</height>
        <depth>3</depth>
    </size>
    <segmented>0</segmented>
    <object>
        <name>luggage</name>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <bndbox>
            <xmin>368</xmin>
            <ymin>246</ymin>
            <xmax>410</xmax>
            <ymax>306</ymax>
        </bndbox>
    </object>
    <object>
        <name>luggage</name>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <bndbox>
            <xmin>357</xmin>
            <ymin>289</ymin>
            <xmax>390</xmax>
            <ymax>338</ymax>
        </bndbox>
    </object>
    <object>
    ...

During Training: Loss Graphs

Total Loss

RPN Localization Loss

RPN Objectness Loss

Examples of Detection (Luggage Only):

In [60]:
Out[60]:
In [61]:
Out[61]:

Examples of Detection (People and Luggage)

In [2]:
Out[2]:
In [3]:
Out[3]:
In [45]:
In [59]:

Expanding Applications of this Project:

  • Visualizing traffic against time at escalators
  • Patterns

  • Peaks at 12:00, 1:00, 3:00, between 4:00 and 5:00 and 6:00
  • between 4:00 and 5:00, person count seems to reach absolute max
  • Sinusoidal pattern: amplitude resembles roughly half the difference between min and max calues of wave
  • Graph shows non-monotonic trend, person count is neither always decreasing or increasing
In [50]:
Out[50]:
In [53]:
Out[53]:

Distribution of Data:

  • Person count distribution is skewed right (positively skewed distribution), meaning most observations are small or medium with only a few that are larger than the rest
  • The right tail shows that there is a small amount of instances where people counts reach high amounts (this is distrbuted along the long "tail")
  • The left tail, representing smaller counts of people is much higher meaning instances of lower counts occur more often
In [54]:
Out[54]:
In [56]:
Out[56]:

Autocorrelation Plot

  • Correlation for time series observations can be calculated with observations from previous time steps referred to as "lags"
  • This type of correlation is referred to as autocorrelation, correlation of the series with itself at various time lags prior to it. For example, at t = 1, the correlation is between the variable at t and t-1
In [57]:
Out[57]:

Citations

Images of Faster R-CNN Architecture taken from: https://tryolabs.com
Fine tuning diagram: http://kcail.com/